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Abstract 

We introduce a new, efficient, principled and 
backpropagation-compatible algorithm for learn¬ 
ing a probability distribution on the weights of 
a neural network, called Bayes by Backprop. It 
regularises the weights by minimising a com¬ 
pression cost, known as the variational free en¬ 
ergy or the expected lower bound on the marginal 
likelihood. We show that this principled kind 
of regularisation yields comparable performance 
to dropout on MNIST classihcation. We then 
demonstrate how the learnt uncertainty in the 
weights can be used to improve generalisation 
in non-linear regression problems, and how this 
weight uncertainty can be used to drive the 
exploration-exploitation trade-off in reinforce¬ 
ment learning. 

1. Introduction 

Plain feedforward neural networks are prone to overht- 
ting. When applied to supervised or reinforcement learn¬ 
ing problems these networks are also often incapable of 
correctly assessing the uncertainty in the training data and 
so make overly conhdent decisions about the correct class, 
prediction or action. We shall address both of these con¬ 
cerns by using variational Bayesian learning to introduce 
uncertainty in the weights of the network. We call our al¬ 
gorithm Bayes by Backprop. We suggest at least three mo¬ 
tivations for introducing uncertainty on the weights: 1) reg¬ 
ularisation via a compression cost on the weights, 2) richer 
representations and predictions from cheap model averag¬ 
ing, and 3) exploration in simple reinforcement learning 
problems such as contextual bandits. 

Various regularisation schemes have been developed to pre- 
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vent overhtting in neural networks such as early stopping, 
weight decay, and dropout (Hinton et al., 2012). In this 
work, we introduce an efficient, principled algorithm for 
regularisation built upon Bayesian inference on the weights 
of the network (MacKay, 1992; Buntine and Weigend, 
1991; MacKay, 1995). This leads to a simple approxi¬ 
mate learning algorithm similar to backpropagation (Le- 
Cun, 1985; Rumelhart et ah, 1988). We shall demonstrate 
how this uncertainty can improve predictive performance 
in regression problems by expressing uncertainty in regions 
with little or no data, how this uncertainty can lead to more 
systematic exploration than e-greedy in contextual bandit 
tasks. 

All weights in our neural networks are represented by prob¬ 
ability distributions over possible values, rather than having 
a single hxed value as is the norm (see Figure 1). Learnt 
representations and computations must therefore be robust 
under perturbation of the weights, but the amount of per¬ 
turbation each weight exhibits is also learnt in a way that 
coherently explains variability in the training data. Thus 
instead of training a single network, the proposed method 
trains an ensemble of networks, where each network has its 
weights drawn from a shared, learnt probability distribu¬ 
tion. Unlike other ensemble methods, our method typically 
only doubles the number of parameters yet trains an inh- 
nite ensemble using unbiased Monte Carlo estimates of the 
gradients. 

In general, exact Bayesian inference on the weights of a 
neural network is intractable as the number of parameters 
is very large and the functional form of a neural network 
does not lend itself to exact integration. Instead we take a 
variational approximation to exact Bayesian updates. We 
build upon the work of Graves (2011), who in turn built 
upon the work of Hinton and Van Camp (1993). In con¬ 
trast to this previous work, we show how the gradients 
of Graves (2011) can be made unbiased and further how 
this method can be used with non-Gaussian priors. Con¬ 
sequently, Bayes by Backprop attains performance compa¬ 
rable to that of dropout (Hinton et ah, 2012). Our method 
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Figure 1. Left: each weight has a fixed value, as provided by clas¬ 
sical backpropagation. Right: each weight is assigned a distribu¬ 
tion, as provided by Bayes by Backprop. 


is related to recent methods in deep, generative modelling 
(Kingma and Welling, 2014; Rezende et al., 2014; Gregor 
et al., 2014), where variational inference has been applied 
to stochastic hidden units of an autoencoder. Whilst the 
number of stochastic hidden units might be in the order of 
thousands, the number of weights in a neural network is 
easily two orders of magnitude larger, making the optimisa¬ 
tion problem much larger scale. Uncertainty in the hidden 
units allows the expression of uncertainty about a particular 
observation, uncertainty in the weights is complementary 
in that it captures uncertainty about which neural network 
is appropriate, leading to regularisation of the weights and 
model averaging. 

This uncertainty can be used to drive exploration in contex¬ 
tual bandit problems using Thompson sampling (Thomp¬ 
son, 1933; Chapelle and Li, 2011; Agrawal and Goyal, 
2012; May et al., 2012). Weights with greater uncertainty 
introduce more variability into the decisions made by the 
network, leading naturally to exploration. As more data are 
observed, the uncertainty can decrease, allowing the deci¬ 
sions made by the network to become more deterministic 
as the environment is better understood. 

The remainder of the paper is organised as follows: Sec¬ 
tion 2 introduces notation and standard learning in neural 
networks. Section 3 describes variational Bayesian learn¬ 
ing for neural networks and our contributions, Section 4 
describes the application to contextual bandit problems, 
whilst Section 5 contains empirical results on a classifica¬ 
tion, a regression and a bandit problem. We conclude with 
a brief discussion in Section 6. 

2. Point Estimates of Neural Networks 

We view a neural network as a probabilistic model 
P(y|x,w): given an input x G K?' a neural network as¬ 
signs a probability to each possible output y € y, using 
the set of parameters or weights w. For classification, y is 
a set of classes and P(y |x, w) is a categorical distribution - 
this corresponds to the cross-entropy or softmax loss, when 


the parameters of the categorical distribution are passed 
through the exponential function then re-normalised. For 
regression is R and P(y|x, w) is a Gaussian distribution 
- this corresponds to a squared loss. 

Inputs X are mapped onto the parameters of a distribu¬ 
tion on y by several successive layers of linear transforma¬ 
tion (given by w) interleaved with element-wise non-linear 
transforms. 

The weights can be learnt by maximum likelihood estima¬ 
tion (MLE): given a set of training examples V = (x^, yi)i, 
the MLE weights are given by: 

wMle = argmaxlogP(P|w) 

W 

= argmaxy^logP(yi|xi,w). 

i 

This is typically achieved by gradient descent (e.g., back- 
propagation), where we assume that logP(P|w) is differ¬ 
entiable in w. 

Regularisation can be introduced by placing a prior upon 
the weights w and finding the maximum a posteriori 
(MAP) weights 

wMAP = argmaxlogP(w|I?) 

W 

= argmaxlogP(P|w) -f logP(w). 

W 

If w are given a Gaussian prior, this yields L2 regularisa¬ 
tion (or weight decay). If w are given a Laplace prior, then 
LI regularisation is recovered. 

3. Being Bayesian by Backpropagation 

Bayesian inference for neural networks calculates the pos¬ 
terior distribution of the weights given the training data, 
P(w|I?). This distribution answers predictive queries 
about unseen data by taking expectations: the predictive 
distribution of an unknown label y of a test data item x, 
is given by P(y|x) = Ep(^|p)[P(y|x, w)]. Each pos¬ 
sible configuration of the weights, weighted according to 
the posterior distribution, makes a prediction about the un¬ 
known label given the test data item x. Thus taking an 
expectation under the posterior distribution on weights is 
equivalent to using an ensemble of an uncountably infi¬ 
nite number of neural networks. Unfortunately, this is in¬ 
tractable for neural networks of any practical size. 

Previously Flinton and Van Camp (1993) and Graves 
(201 1) suggested finding a variational approximation to the 
Bayesian posterior distribution on the weights. Variational 
learning finds the parameters 6* of a distribution on the 
weights 5(w|0) that minimises the Kullback-Leibler (KL) 
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divergence with the true Bayesian posterior on the weights: 

0* = argminKL[g(w|6»)||P(w|X>)] 

0 

= argmm| log 

= argminKL[g(w|6») || P(w)] - E,(w|e) [logP(X>|w)] . 
6 

The resulting cost function is variously known as the varia¬ 
tional free energy (Neal and Hinton, 1998; Yedidia et ah, 
2000; Friston et ah, 2007) or the expected lower bound 
(Saul et al., 1996; Neal and Hinton, 1998; Jaakkola and 
Jordan, 2000). For simplicity we shall denote it as 


The deterministic function t{9, e) transforms a sample of 
parameter-free noise e and the variational posterior param¬ 
eters 9 into a sample from the variational posterior. Below 
we shall see how this transform works in practice for the 
Gaussian case. 

We apply Proposition 1 to the optimisation problem in 
(1): let /(w, 0) = logg(w|0) — log P(w)P(I?|w). Us¬ 
ing Monte Carlo sampling to evaluate the expectations, 
a backpropagation-like (LeCun, 1985; Rumelhart et al., 
1988) algorithm is obtained for variational Bayesian infer¬ 
ence in neural networks - Bayes by Backprop - which uses 
unbiased estimates of gradients of the cost in (1) to learn a 
distribution over the weights of a neural network. 


J^(T>,9) =KL[g(w|6») || P(w)] 

- E,(w|e) [logP(X>|w)]. (1) 

The cost function of (1) is a sum of a data-dependent part, 
which we shall refer to as the likelihood cost, and a prior- 
dependent part, which we shall refer to as the complexity 
cost. The cost function embodies a trade-off between satis¬ 
fying the complexity of the data T> and satisfying the sim¬ 
plicity prior P(w). (1) is also readily given an information 
theoretic interpretation as a minimum description length 
cost (Hinton and Van Camp, 1993; Graves, 2011). Exactly 
minimising this cost naively is computationally prohibitive. 
Instead gradient descent and various approximations are 
used. 


3.1. Unbiased Monte Carlo gradients 

Under certain conditions, the derivative of an expectation 
can be expressed as the expectation of a derivative: 

Proposition 1. Let e be a random variable having a prob¬ 
ability density given by g(e) and let w = t{9, e) where 
t{9, e) is a deterministic function. Suppose further that 
the marginal probability density ofw, g(w|0), is such that 
q(e)dt = q{'w\9)d'w. Then for a function f with deriva¬ 
tives in w: 




Eq(w|e)[/(W,6»)] = E,(,) 


a/(w, 9) dw 
dw 89 


g/(w,g) ' 

89 


Proof 


8 _ 

89 


E5(w|e)[/(w,6»)] 


8_ 

89 

8_ 

89 


/(w, 0)g(w|0)dw 
/(w,6»)g(e)de 


= E 




a/(w,6»)aw 8f{Yv,9) 


8w 89 


89 


□ 


Proposition 1 is a generalisation of the Gaussian re- 
parameterisation trick (Opper and Archambeau, 2009; 
Kingma and Welling, 2014; Rezende et al., 2014) used for 
latent variable models, applied to Bayesian learning of neu¬ 
ral networks. Our work differs from this previous work in 
several significant ways. Bayes by Backprop operates on 
weights (of which there are a great many), whilst most pre¬ 
vious work applies this method to learning distributions on 
stochastic hidden units (of which there are far fewer than 
the number of weights). Titsias and Lazaro-Gredilla (2014) 
considered a large-scale logistic regression task. Unlike 
previous work, we do not use the closed form of the com¬ 
plexity cost (or entropic part): not requiring a closed form 
of the complexity cost allows many more combinations of 
prior and variational posterior families. Indeed this scheme 
is also simple to implement and allows prior/posterior com¬ 
binations to be interchanged. We approximate the exact 
cost (1) as: 


J^{V,9) « ^logg(wW|6l) -logP(w«) 

-logP(I?|wW) (2) 

where denotes the Ith Monte Carlo sample drawn from 
the variational posterior |0). Note that every term of 

this approximate cost depends upon the particular weights 
drawn from the variational posterior: this is an instance of 
a variance reduction technique known as common random 
numbers (Owen, 2013). In previous work, where a closed 
form complexity cost or closed form entropy term are used, 
part of the cost is sensitive to particular draws from the 
posterior, whilst the closed form part is oblivious. Since 
each additive term in the approximate cost in (2) uses the 
same weight samples, the gradients of (2) are only affected 
by the parts of the posterior distribution characterised by 
the weight samples. In practice, we did not find this to 
perform better than using a closed form KL (where it could 
be computed), but we did not find it to perform worse. In 
our experiments, we found that a prior without an easy-to- 
compute closed form complexity cost performed best. 
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3.2. Gaussian variational posterior 

Suppose that the variational posterior is a diagonal Gaus¬ 
sian distribution, then a sample of the weights w can be 
obtained by sampling a unit Gaussian, shifting it by a mean 
/i and scaling by a standard deviation a. We parameterise 
the standard deviation pointwise as cr = log(l + exp(p)) 
and so cr is always non-negative. The variational posterior 
parameters are 0 = (p,, p). Thus the transform from a sam¬ 
ple of parameter-free noise and the variational posterior pa¬ 
rameters that yields a posterior sample of the weights w is: 
w = t{9, e) = p -f log(l -I- exp(p)) o e where o is point- 
wise multiplication. Each step of optimisation proceeds as 
follows: 

1. Sample e ~ A/’(0,/). 

2. Let w = p -I- log(l -I- exp(p)) o e. 

3. Let 9 = (p, p). 

4. Let /(w, 9) = logq(w|0) — logP(w)P(I?|w). 

5. Calculate the gradient with respect to the mean 

^ _ df(w,9) , df(w,9) 

^11 - \ r) • 

aw a/i 

6. Calculate the gradient with respect to the standard de¬ 
viation parameter p 

dfiv,,9) e dfi^,9) 

^ 9w 1 -f exp(—p) dp 

7. Update the variational parameters: 

p ^ p - (5) 

p^p-aAp. (6) 

Note that the term of the gradients for the mean and 

standard deviation are shared and are exactly the gradients 
found by the usual backpropagation algorithm on a neural 
network. Thus, remarkably, to learn both the mean and the 
standard deviation we must simply calculate the usual gra¬ 
dients found by backpropagation, and then scale and shift 
them as above. 

3.3. Scale mixture prior 

Having liberated our algorithm from the confines of Gaus¬ 
sian priors and posteriors, we propose a simple scale mix¬ 
ture prior combined with a diagonal Gaussian posterior. 
The diagonal Gaussian posterior is largely free from nu¬ 
merical issues, and two degrees of freedom per weight only 
increases the number of parameters to optimise by a factor 
of two, whilst giving each weight its own quantity of un¬ 
certainty. 

We pick a fixed-form prior and do not adjust its hyper¬ 
parameters during training, instead picking the them by 


cross-validation where possible. Empirically we found op¬ 
timising the parameters of a prior P(w) (by taking deriva¬ 
tives of (1)) to not be useful, and yield worse results. 
Graves (2011) and Titsias and Lazaro-Gredilla (2014) pro¬ 
pose closed form updates of the prior hyperparameters. 
Changing the prior based upon the data that it is meant to 
regularise is known as empirical Bayes and there is much 
debate as to its validity (Gelman, 2008). A reason why it 
fails for Bayes by Backprop is as follows: it can be eas¬ 
ier to change the prior parameters (of which there are few) 
than it is to change the posterior parameters (of which there 
are many) and so very quickly the prior parameters try to 
capture the empirical distribution of the weights at the be¬ 
ginning of learning. Thus the prior learns to fit poor initial 
parameters quickly, and makes the cost in (1) less willing 
to move away from poor initial parameters. This can yield 
slow convergence, introduce strange local minima and re¬ 
sult in poor performance. 

We propose using a scale mixture of two Gaussian densi¬ 
ties as the prior. Each density is zero mean, but differing 
variances: 


= J|7rA/'(wj|0,CTi) + (1 - 7r)A/'(wj|0,(T2), (V 

J 


where is the jth weight of the network, N{x\p^ cr^) is 
the Gaussian density evaluated at x with mean p and vari¬ 
ance and erf ^nd erf are the variances of the mixture 
components. The first mixture component of the prior is 
given a larger variance than the second, ai > provid¬ 
ing a heavier tail in the prior density than a plain Gaussian 
prior. The second mixture component has a small variance 
CT 2 ^ 1 causing many of the weights to a priori tightly con¬ 
centrate around zero. Our prior resembles a spike-and-slab 
prior (Mitchell and Beauchamp, 1988; George and McCul¬ 
loch, 1993; Chipman, 1996), where instead all the prior pa¬ 
rameters are shared among all the weights. This makes the 
prior more amenable to use during optimisation by stochas¬ 
tic gradient descent and avoids the need for prior parameter 
optimisation based upon training data. 

3.4. Minibatches and KL re-weighting 

As several authors have noted, the cost in (1) is amenable 
to minibatch optimisation, often used with neural networks: 
for each epoch of optimisation the training data V is ran¬ 
domly split into a partition of M equally-sized subsets, 
T>i,T> 2 , ■ ■ ■, T>m- Each gradient is averaged over all ele¬ 
ments in one of these minibatches; a trade-off between a 
fully batched gradient descent and a fully stochastic gradi¬ 
ent descent. Graves (2011) proposes minimising the mini- 
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batch cost for minibatch i = 1,2,...,M: 

= ^KL[g(w|0) II P(w)] 

- ]E 9 (w|e) [log P(X>* Iw)]. (8) 

This is equivalent to the cost in (1) since 0) = 

J^{'D,9). There are many ways to weight the complexity 
cost relative to the likelihood cost on each minibatch. For 
example, if minibatches are partitioned uniformly at ran¬ 
dom, the KL cost can be distributed non-uniformly among 
the minibatches at each epoch. Let tt G [0,1]^ and 
— 1’ define: 

J^r{V,,0)=7r,KL[q{w\0) ||P(w)] 

- Eq(w|e) [logP(X>,|w)] (9) 

Then ^)] = 9) where Em denotes 

an expectation over the random partitioning of minibatches. 
In particular, we found the scheme to work 

well: the first few minibatches are heavily influenced by 
the complexity cost, whilst the later minibatches are largely 
influenced by the data. At the beginning of learning this is 
particularly useful as for the first few minibatches changes 
in the weights due to the data are slight and as more data 
are seen, data become more influential and the prior less 
influential. 

4. Contextual Bandits 

Contextual bandits are simple reinforcement learning prob¬ 
lems without persistent state (Li et al., 2010; Filippi et al., 
2010). At each step an agent is presented with a context 
X and a choice of one of K possible actions a. Different 
actions yield different unknown rewards r. The agent must 
pick the action that yields the highest expected reward. The 
context is assumed to be presented independent of any pre¬ 
vious actions, rewards or contexts. 

An agent builds a model of the distribution of the rewards 
conditioned upon the action and the context: P{r\x, a, w). 
It then uses this model to pick its action. Note, importantly, 
that an agent does not know what reward it could have re¬ 
ceived for an action that it did not pick, a difficulty often 
known as “the absence of counterfactual”. As the agent’s 
model P(r|a::,a,w) is trained online, based upon the ac¬ 
tions chosen, unless exploratory actions are taken, the agent 
may perform suboptimally. 

4.1. Thompson Sampling for Neural Networks 

As in Section 2, P{r\x, a, w) can be modelled by a neural 
network where w are the weights of the neural network. 
However if this network is simply fit to observations and 
the action with the highest expected reward taken at each 


time, the agent can under-explore, as it may miss more re¬ 
warding actions.* 

Thompson sampling (Thompson, 1933) is a popular means 
of picking an action that trades-off between exploitation 
(picking the best known action) and exploration (picking 
what might be a suboptimal arm to learn more). Thomp¬ 
son sampling usually necessitates a Bayesian treatment of 
the model parameters. At each step, Thompson sampling 
draws a new set of parameters and then picks the action 
relative to those parameters. This can be seen as a kind 
of stochastic hypothesis testing: more probable parame¬ 
ters are drawn more often and thus refuted or confirmed 
the fastest. More concretely Thompson sampling proceeds 
as follows: 

1. Sample a new set of parameters for the model. 

2. Pick the action with the highest expected reward ac¬ 
cording to the sampled parameters. 

3. Update the model. Go to 1. 

There is an increasing literature concerning the efficacy and 
justification of this means of exploration (Chapelle and Li, 
2011; May et al., 2012; Kaufmann et al., 2012; Agrawal 
and Goyal, 2012; 2013). Thompson sampling is easily 
adapted to neural networks using the variational posterior 
found in Section 3: 

1. Sample weights from the variational posterior: w ^ 
g(w|6»). 

2. Receive the context x. 

3. Pick the action a that minimises Ep(r|a;,a.w) [f] 

4. Receive reward r. 

5. Update variational parameters 9 according to Sec¬ 
tion 3. Go to 1. 

Note that it is possible, as mentioned in Section 3.1, to de¬ 
crease the variance of the gradient estimates, trading off for 
reduced exploration, by using more than one Monte Carlo 
sample, using the corresponding networks as an ensemble 
and picking the action by minimising the average of the 
expectations. 

Initially the variational posterior will be close to the prior, 
and actions will be picked uniformly. As the agent takes ac¬ 
tions, the variational posterior will begin to converge, and 
uncertainty on many parameters can decrease, and so ac¬ 
tion selection will become more deterministic, focusing on 
the high expected reward actions discovered so far. It is 

* Interestingly, depending upon how w are initialised and the 
mean of prior used during MAP inference, it is sometimes pos¬ 
sible to obtain another heuristic for the exploration-exploitation 
trade-off: optimism-under-uncertainty. We leave this for future 
investigation. 
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Table 1. Classification Error Rates on MNIST. * indicates result 


used an ensemble of 5 networks. 


Method 

# Units/Layer 

# Weights 

Test 

Error 

SGD, no regularisation (Simard a ai , 2003) 

800 

1.3m 

1.6% 

SGD, dropout (Hinton etal , 2012) 

SGD, dropconnect (Wan et ai., 2013) 

800 

1.3m 

« 1.3% 

1.2%* 

SGD 

400 

500k 

1.83% 


800 

1.3m 

1.84% 


1200 

2.4m 

1.88% 

SGD, dropout 

400 

500k 

1.51% 


800 

1.3m 

1.33% 


1200 

2.4m 

1.36% 

Bayes by Backprop, Gaussian 

400 

500k 

1.82% 


800 

1.3m 

1.99% 


1200 

2.4m 

2.04% 

Bayes by Backprop, Scale mixture 

400 

500k 

1.36% 


800 

1.3m 

1.34% 


1200 

2.4m 

1.32% 



Figure 2. Test error on MNIST as training progresses. 



known that variational methods under-estimate uncertainty 
(Minka, 2001; 2005; Bishop, 2006) which could lead to 
under-exploration and premature convergence in practice, 
but we did not find this in practice. 

5. Experiments 

We present some empirical evaluation of the methods pro¬ 
posed above; on MNIST classification, on a non-linear re¬ 
gression task, and on a contextual bandits task. 

5.1. Classification on MNIST 

We trained networks of various sizes on the MNIST dig¬ 
its dataset (LeCun and Cortes, 1998), consisting of 60,000 
training and 10,000 testing pixel images of size 28 by 28. 
Each image is labelled with its corresponding number (be¬ 
tween zero and nine, inclusive). We preprocessed the pix¬ 
els by dividing values by 126. Many methods have been 
proposed to improve results on MNIST: generative pre¬ 
training, convolutions, distortions, etc. Here we shall focus 
on improving the performance of an ordinary feedforward 
neural network without using any of these methods. We 
used a network of two hidden layers of rectified linear units 
(Nair and Hinton, 2010; Glorot et al., 201 1), and a softmax 
output layer with 10 units, one for each possible label. 

According to Hinton et al. (2012), the best published feed¬ 
forward neural network classification result on MNIST (ex¬ 
cluding those using data set augmentation, convolutions, 
etc.) is 1.6% (Simard et al., 2003), whilst dropout with 
an L2 regulariser attains errors around 1.3%. Results from 
Bayes by Backprop are shown in Table 1, for various sized 


Figure 3. Histogram of the trained weights of the neural network, 
for Dropout, plain SGD, and samples from Bayes by Backprop. 


networks, using either a Gaussian or Gaussian scale mix¬ 
ture prior. Performance is comparable to that of dropout, 
perhaps slightly better, as also see on Figure 2. Note that 
we trained on 50,000 digits and used 10,000 digits as a val¬ 
idation set, whilst Hinton et al. (2012) trained on 60,000 
digits and did not use a validation set. We used the vali¬ 
dation set to pick the best hyperparameters (learning rate, 
number of gradients to average) and so we also repeated 
this protocol for dropout and SGD (Stochastic Gradient De¬ 
scent on the MLE objective in Section 2). We considered 
learning rates of 10“^, 10“"*^ and 10“® with minibatches 
of size 128. For Bayes by Backprop, we averaged over ei¬ 
ther 1, 2, 5, or 10 samples and considered tt G {j, |}, 

- logCTi G {0,1, 2} and - logcr 2 G {6, 7, 8}. 

Figure 2 shows the learning curves on the test set for Bayes 
by Backprop, dropout and SGD on a network with two lay¬ 
ers of 1200 rectified linear units. As can be seen, SGD 
converges the quickest, initially obtaining a low test er¬ 
ror and then overfitting. Bayes by Backprop and dropout 
converge at similar rates (although each iteration of Bayes 
by Backprop is more expensive than dropout - around two 
times slower). Eventually Bayes by Backprop converges 
on a better test error than dropout after 600 epochs. 

Figure 3 shows density estimates of the weights. The Bayes 
by Backprop weights are sampled from the variational pos¬ 
terior, and the dropout weights are those used at test time. 
Interestingly the regularised networks found by dropout 
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and Bayes by Backprop have a greater range and with fewer 
centred at zero than those found by SGD. Bayes by Back- 
prop uses the greatest range of weights. 




Figure 4. Density and CDF of the Signal-to-Noise ratio over all 
weights in the network. The red line denotes the 75% cut-off. 

In Table 2, we examine the effect of replacing the vari¬ 
ational posterior on some of the weights with a constant 
zero, so as to determine the level of redundancy in the 
network found by Bayes by Backprop. We took a Bayes 
by Backprop trained network with two layers of 1200 
units^ and ordered the weights by their signal-to-noise ra¬ 
tio {\ni\/(Ji). We removed the weights with the lowest sig¬ 
nal to noise ratio. As can be seen in Table 2, even when 
95% of the weights are removed the network still performs 
well, with a significant drop in performance once 98% of 
the weights have been removed. 

In Figure 4 we examined the distribution of the signal-to- 
noise relative to the cut-off in the network uses in Table 2. 
The lower plot shows the cumulative distribution of signal- 
to-noise ratio, whilst the top plot shows the density. From 
the density plot we see there are two modalities of signal- 
to-noise ratios, and from the CDF we see that the 75% 
cut-off separates these two peaks. These two peaks coin¬ 
cide with a drop in performance in Table 2 from 1.24% 
to 1.29%, suggesting that the signal-to-noise heuristic is in 
fact related to the test performance. 

^We used a network from the end of training rather than pick¬ 
ing a network with a low validation cost found during training, 
hence the disparity with results in Table 1. The lowest test error 
observed was 1.12%. 


Table 2. Classification Errors after Weight pruning 


Proportion removed 

# Weights 

Test Error 

0% 

2.4m 

1.24% 

50% 

1.2m 

1.24% 

75% 

600k 

1.24% 

95% 

120k 

1.29% 

98% 

48k 

1.39% 


It is interesting to contrast this weight removal approach 
to obtaining a fast, smaller, sparse network for prediction 
after training with the approach taken by distillation (Hin¬ 
ton et al., 2014) which requires an extra stage of training 
to obtain a compressed prediction model. As with distil¬ 
lation, our method begins with an ensemble (one for each 
possible assignment of the weights). However, unlike dis¬ 
tillation, we can simply obtain a subset of this ensemble by 
using the probabilistic properties of the weight distributions 
learnt to gracefully prune the ensemble down into a smaller 
network. Thus even though networks trained by Bayes by 
Backprop may have twice as many weights, the number of 
parameters that actually need to be stored at run time can be 
far fewer. Graves (2011) also considered pruning weights 
using the signal to noise ratio, but demonstrated results on 
a network 20 times smaller and did not prune as high a 
proportion of weights (at most 11%) whilst still maintain¬ 
ing good test performance. The scale mixture prior used 
by Bayes by Backprop encourages a broad spread of the 
weights. Many of these weights can be successfully pruned 
without impacting performance significantly. 

5.2. Regression curves 

We generated training data from the curve; 

y = X + 0.3sin(27r(a; -|- e)) -F 0.3 sin(47r(a; -F e)) -F e 

where e ^ A/^(0,0.02). Figure 5 shows two examples of 
fitting a neural network to these data, minimising a condi¬ 
tional Gaussian loss. Note that in the regions of the input 
space where there are no data, the ordinary neural network 
reduces the variance to zero and chooses to fit a particu¬ 
lar function, even though there are many possible extrap¬ 
olations of the training data. On the left, Bayesian model 
averaging affects predictions: where there are no data, the 
confidence intervals diverge, reflecting there being many 
possible extrapolations. In this case Bayes by Backprop 
prefers to be uncertain where there are no nearby data, as 
opposed to a standard neural network which can be overly 
confident. 

5.3. Bandits on Mushroom Task 

We take the UCI Mushrooms data set (Bache and Lichman, 
2013), and cast it as a bandit task, similar to Guez (2015, 
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Figure 5. Regression of noisy data with interquatile ranges. Black 
crosses are training samples. Red lines are median predictions. 
Blue/purple region is interquartile range. Left: Bayes by Back- 
prop neural network, Right: standard neural network. 



Figure 6. Comparison of cumulative regret of various agents on 
the mushroom bandit task, averaged over five runs. Lower is bet¬ 
ter. 

Chapter 6). Each mushroom has a set of features, which we 
treat as the context for the bandit, and is labelled as edible 
or poisonous. An agent can either eat or not eat a mush¬ 
room. If an agent eats an edible mushroom, then it receives 
a reward of 5. If an agent eats a poisonous mushroom, then 
with probability ^ it receives a reward of —35, otherwise 
a reward of 5. If an agent elects not to eat a mushroom, 
it receives a reward of 0. Thus an agent expects to receive 
a reward of 5 for eating an edible reward, but an expected 
reward of —15 for eating a poisonous mushroom. 

Regret measures the difference between the reward achiev¬ 
able by an oracle and the reward received by an agent. In 
this case, an oracle will always receive a reward of 5 for an 
edible mushroom, or 0 for a poisonous mushroom. We take 
the cumulative sum of regret of several agents and show 
them in Figure 6. Each agent uses a neural network with 
two hidden layers of 100 rectified linear units. The input 
to the network is a vector consisting of the mushroom fea¬ 
tures (context) and a one of K encoding of the action. The 
output of the network is a single scalar, representing the ex¬ 
pected reward of the given action in the given context. For 
Bayes by Backprop, we sampled the weights twice and av¬ 
eraged two of these outputs to obtain the expected reward 


for action selection. We kept the last 4096 reward, context 
and action tuples in a buffer, and trained the networks us¬ 
ing randomly drawn minibatches of size 64 for 64 training 
steps (64 X 64 = 4096) per interaction with the Mushroom 
bandit. A common heuristic for trading-off exploration vs. 
exploitation is to follow an £-greedy policy: with proba¬ 
bility £ propose a uniformly random action, otherwise pick 
the best action according to the neural network. 

Figure 6 compares a Bayes by Backprop agent with three 
£-greedy agents, for values of £ of 0% (pure greedy), 1%, 
and 5%. An £ of 5% appears to over-explore, whereas a 
purely greedy agent does poorly at the beginning, greed¬ 
ily electing to eat nothing, but then does much better once 
it has seen enough data. It seems that non-local function 
approximation updates allow the greedy agent to explore, 
as for the first 1, 000 steps, the agent eats nothing but after 
approximately 1,000 the greedy agent suddenly decides to 
eat mushrooms. The Bayes by Backprop agent explores 
from the beginning, both eating and ignoring mushrooms 
and quickly converges on eating and non-eating with an al¬ 
most perfect rate (hence the almost flat regret). 

6. Discussion 

We introduced a new algorithm for learning neural net¬ 
works with uncertainty on the weights called Bayes by 
Backprop. It optimises a well-defined objective function 
to learn a distribution on the weights of a neural network. 
The algorithm achieves good results in several domains. 
When classifying MNIST digits, performance from Bayes 
by Backprop is comparable to that of dropout. We demon¬ 
strated on a simple non-linear regression problem that the 
uncertainty introduced allows the network to make more 
reasonable predictions about unseen data. Finally, for con¬ 
textual bandits, we showed how Bayes by Backprop can 
automatically learn how to trade-off exploration and ex¬ 
ploitation. Since Bayes by Backprop simply uses gradient 
updates, it can readily be scaled using multi-machine opti¬ 
misation schemes such as asynchronous SGD (Dean et al., 
2012). Furthermore, all of the operations used are readily 
implemented on a GPU. 
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